Managing Large-scale Probabilistic Databases

نویسندگان

  • Christopher Ré
  • Dan Suciu
  • Magdalena Balazinska
  • Anna Karlin
چکیده

Managing Large-scale Probabilistic Databases Christopher Ré Chair of the Supervisory Committee: Professor Dan Suciu Computer Science & Engineering Modern applications are driven by data, and increasingly the data driving these applications are imprecise. The set of applications that generate imprecise data is diverse: In sensor database applications, the goal is to measure some aspect of the physical world (such as temperature in a region or a person’s location). Such an application has no choice but to deal with imprecision, as measuring the physical world is inherently imprecise. In data integration, consider two databases that refer to the same set of real-world entities, but the way in which they refer to those entities is slightly different. For example, one database may contain an entity ‘J. Smith’ while the second database refers to ‘John Smith’ . In such a scenario, the large size of the data makes it too costly to manually reconcile all references in the two databases. To lower the cost of integration, state-of-the-art approaches allow the data to be imprecise. In addition to applications which are forced to cope with imprecision, emerging data-driven applications, such as large-scale information extraction, natively produce and manipulate similarity scores. In all these domains, the current state-of-the-art approach is to allow the data to be imprecise and to shift the burden of coping with imprecision to applications. The thesis of this work is that it is possible to effectively manage large, imprecise databases using a generic approach based on probability theory. The key technical challenge in building such a general-purpose approach is performance, and the technical contributions of this dissertation are techniques for efficient evaluation over probabilistic databases. In particular, we demonstrate that it is possible to run complex SQL queries on tens of gigabytes of probabilistic data with performance that is comparable to a standard relational database engine.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Databases ∗ Dan

Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model and present some fundamental theoretical results for query evaluation on probabilistic databases. 1 The Quest for Probabilistic Databases Comm...

متن کامل

Scalable Statistical Modeling and Query Processing over Large Scale Uncertain Databases

Title of Dissertation: SCALABLE STATISTICAL MODELING AND QUERY PROCESSING OVER LARGE SCALE UNCERTAIN DATABASES Bhargav Kanagal Shamanna Doctor of Philosophy, 2011 Dissertation directed by: Dr. Amol Deshpande Dept. of Computer Science The past decade has witnessed a large number of novel applications that generate imprecise, uncertain and incomplete data. Examples include monitoring infrastructu...

متن کامل

Managing Uncertainty in Social Networks

Social network analysis (SNA) has become a mature scientific field over the last 50 years and is now an area with massive commercial appeal and renewed research interest. In this paper, we argue that new methods for collecting social nework strucuture, and the shift in scale of these networks, introduces a greater degree of imprecision that requires rethinking on how SNA techniques can be appli...

متن کامل

Chapter 4 GRAPHICAL MODELS FOR UNCERTAIN DATA

Graphical models are a popular and well-studied framework for compact representation of a joint probability distribution over a large number of interdependent variables, and for efficient reasoning about such a distribution. They have been proven useful in a wide range of domains from natural language processing to computer vision to bioinformatics. In this chapter, we present an approach to us...

متن کامل

Graphical Models for Uncertain Data

Graphical models are a popular and well-studied framework for compact representation of a joint probability distribution over a large number of interdependent variables, and for efficient reasoning about such a distribution. They have been proven useful in a wide range of domains from natural language processing to computer vision to bioinformatics. In this chapter, we present an approach to us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009